Sparse solver on GPU #600

nychiang · 2023-03-02T05:43:14Z

This PR shows the sparse solver can run under GPU mode, i.e., mem_space = device and compute_mode = gpu.
I created a new example NlpSparseRajaEx2 in this PR.

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

nychiang · 2023-03-02T07:57:53Z

Example NlpSparseRajaEx2 (gpu sparse solver) works fine on newell and lassen. Not sure why it fails on marianas.
In addition, not sure why one spack build failed. Receive same failure in PR #589 .
@cameronrutherford @pelesh Any idea?

cnpetra · 2023-03-02T15:14:44Z

@cameronrutherford : there is a Spack Build failure for all the PRs/branches, see #601

nychiang · 2023-03-03T22:43:06Z

@cnpetra it seems we are running out of space on PNNL platforms.

pelesh

This looks very good and is a sound way to enable GPU-resident optimization with sparse linear algebra.

I am not sure if using inheritance CUSOLVERGPU -> CUSOLVER is the best approach here, even if this is just a temporary solution as @nychiang suggests. It will lead to a lot of duplicate code, which we will need to remove later on.

Current implementation of of cusolver-lu module already works with the data on the device. It requires minimal modification to its current code to enable receiving data at HiOp specified memory space. I suggest we work directly with hiopLinSolverSymSparseCUSOLVER class. Going around with a derived class seems unnecessary.

cnpetra · 2023-03-06T17:30:50Z

This looks very good and is a sound way to enable GPU-resident optimization with sparse linear algebra.

I am not sure if using inheritance CUSOLVERGPU -> CUSOLVER is the best approach here, even if this is just a temporary solution as @nychiang suggests. It will lead to a lot of duplicate code, which we will need to remove later on.

Current implementation of of cusolver-lu module already works with the data on the device. It requires minimal modification to its current code to enable receiving data at HiOp specified memory space. I suggest we work directly with hiopLinSolverSymSparseCUSOLVER class. Going around with a derived class seems unnecessary.

this PR is supposed to only take care of GPU porting up to the linear solver(s). cuSolver was somehow chosen arbitrarily, just to ensure the rest of the kernels run fine. The porting of linear solvers is a bit more involving and better to be done later on.

cnpetra · 2023-03-07T02:46:45Z

this PR looks great. How do you want to proceed re: PR #589 ? Which one to merge first?

cnpetra · 2023-03-07T02:47:59Z

maybe also merge develop in this.

also PNNL CI is not passing.

nychiang · 2023-03-07T04:10:36Z

maybe also merge develop in this.

also PNNL CI is not passing.

I tried to remove all the member objects from RAJA, but it still fails marianas. Not sure which part fails it.
Since I haven't figure out this problem, I think we can merge #589 first.

pelesh · 2023-03-08T08:58:02Z

src/LinAlg/hiopLinSolverSparseCUSOLVER.cpp

+    hiopMatrixSparse* swap_ptr = M_;
+    M_ = M_host_;
+    M_host_ = swap_ptr;


If sparse matrix is implemented correctly, you really should not have to do this. The matrix class needs to provide both, the device data and the host mirror, as well as methods to sync the two.

hiopMatrixSparse has only got one member function M() to access its data, while hiopMatrixRajaSparseTriplet has accessors M() and M_host() for the data on host and device to see here.

However, the problem is that previously sparse solver only works on cpu and hybrid. It assumes the KKT matrix M_ is always on host and has type of hiopMatrixSparseTriplet. Therefore in the existing implementation of hiopLinSolverSparseCUSOLVER.cpp, there are many places use host data `M_->M()' and then transfer it to device (for hybrid usage).
(for example, see here). The current implementation cannot directly work with matrix on device.

If we want to use the existing hiopLinSolverSparseCUSOLVER without adding a new wrapper class, I need to dynamic_cast M_ to type of hiopMatrixRajaSparseTriplet first, sync the data, and change M_->M() to M_->M_host() in couple of places. This seems more works than adding a new wrapper class.

As a result, we think creating a new wrapper class can minimize the changes to your code, and easier to remove them once the linear solver can support data on device.

src/Optimization/hiopKKTLinSysSparse.cpp

pelesh · 2023-03-08T09:00:49Z

this PR is supposed to only take care of GPU porting up to the linear solver(s). cuSolver was somehow chosen arbitrarily, just to ensure the rest of the kernels run fine. The porting of linear solvers is a bit more involving and better to be done later on.

The linear solver is already ported to GPU, so I am not sure what else needs to be done there. Regardless of that, creating a CUSOLVERGPU class is not helping what you are trying to achieve in this PR. It is really hard to see rationale behind adding that class.

nychiang · 2023-03-08T10:52:55Z

@pelesh As @cnpetra mentioned, right now I am working on GPU porting up to the linear solver(s).
This PR is to create a working example to test all the workflow.
Soon I will have another PR to port the initialization part onto device --- Right now we preprocess some data on cpu and then transfer them to gpu.

nychiang · 2023-03-11T01:18:35Z

@cnpetra see #605. Host array is used for var/con type, and it is used in MDS_RAJA example as well. Will file another PR to fix this issue.

pelesh · 2023-03-12T17:49:51Z

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

Note that GPU sparse linear solvers need to set up and run first factorization on the host before moving all data to GPU. This is what sparse HiOp module needs to account for, as well. This is different than MDS, where you can move data to GPU after the initial setup.

This is not the case only for sparse linear solver, but also if you have sparse matrix-matrix product, you would want to run first iteration on the host to compute the sparsity pattern of the matrix-matrix product there, and then reuse it for subsequent iterations on GPU.

pelesh · 2023-03-12T17:55:26Z

src/Drivers/Sparse/NlpSparseRajaEx2.cpp

+        MJacS[0] = 4.0;
+        MJacS[1] = 2.0;
+        // --- constraint 2 body ---> 2*x_1 + x_3
+        MJacS[2] = 2.0;
+        MJacS[3] = 1.0;


I am not sure this makes sense. Consider rewriting as

Suggested change

MJacS[0] = 4.0;

MJacS[1] = 2.0;

// --- constraint 2 body ---> 2*x_1 + x_3

MJacS[2] = 2.0;

MJacS[3] = 1.0;

if(itrow == 0) {

MJacS[0] = 4.0;

MJacS[1] = 2.0;

// --- constraint 2 body ---> 2*x_1 + x_3

MJacS[2] = 2.0;

MJacS[3] = 1.0;

}

Actually, this is better added on the host and then copied to GPU, since these are just constant Jacobian elements that stay the same throughout the computation.

if condition is not required here, since here RAJA::RangeSegment only has one valid number.

pelesh

One missing piece in this PR is a flag that would inform HiOp when to move data to the device. Unlike in dense linear algebra case, data is not moved to GPU after initial setup, but after sparsity patterns for matrix-matrix products and matrix factors are computed. The flag should be set up to true once HiOp sparse module and the linear solver compute sparsity patterns.

pelesh · 2023-03-13T13:59:12Z

@pelesh As @cnpetra mentioned, right now I am working on GPU porting up to the linear solver(s). This PR is to create a working example to test all the workflow. Soon I will have another PR to port the initialization part onto device --- Right now we preprocess some data on cpu and then transfer them to gpu.

This is pretty much how it should work in the long term. I am not sure that porting initialization to the device will buy you much; it will likely be a performance loss. Most of setup operations are not SIMD and perform far better on CPU.

cnpetra · 2023-03-13T14:01:28Z

We should remove class hiopLinSolverSymSparseCUSOLVERGPU once we have a linear solver that can work with data on device.

Note that GPU sparse linear solvers need to set up and run first factorization on the host before moving all data to GPU. This is what sparse HiOp module needs to account for, as well. This is different than MDS, where you can move data to GPU after the initial setup.

This is not the case only for sparse linear solver, but also if you have sparse matrix-matrix product, you would want to run first iteration on the host to compute the sparsity pattern of the matrix-matrix product there, and then reuse it for subsequent iterations on GPU.

when mem_space is gpu, the linear solver class can/should create data copies on cpu if it needs to. See for example: https://github.com/LLNL/hiop/blob/develop/src/LinAlg/hiopLinSolverCholCuSparse.cpp#L228

cnpetra · 2023-03-13T14:05:19Z

One missing piece in this PR is a flag that would inform HiOp when to move data to the device. Unlike in dense linear algebra case, data is not moved to GPU after initial setup, but after sparsity patterns for matrix-matrix products and matrix factors are computed. The flag should be set up to true once HiOp sparse module and the linear solver compute sparsity patterns.

such data movements should be handled by the linear solver class. And there are sparse linear solvers currently in HiOp that do work fully on GPU (but under different options they may do cpu work, like computing min fill-in in the factors in the Cholesky class, see the comment above)

nychiang marked this pull request as ready for review March 4, 2023 01:24

nychiang requested review from kswirydo, cnpetra and pelesh March 4, 2023 01:24

pelesh requested changes Mar 5, 2023

View reviewed changes

pelesh reviewed Mar 8, 2023

View reviewed changes

src/Optimization/hiopKKTLinSysSparse.cpp Outdated Show resolved Hide resolved

nychiang force-pushed the sparse-ex-raja-dev branch from e634b6e to 00d9967 Compare March 11, 2023 00:17

pelesh reviewed Mar 12, 2023

View reviewed changes

pelesh requested changes Mar 13, 2023

View reviewed changes

nychiang added 6 commits March 13, 2023 10:24

add sparse raja example

9cc25bd

minor changes

994f748

update cmake file

3b42f8d

add driver

f051f7d

fix cusolver for hybrid mode

cbe3a53

update cmake file

bed6ac8

nychiang added 3 commits March 13, 2023 10:28

re-engineer the code

8d16e81

try to fix marianas raja issue

353268c

host data for NonlinearityType

5fdb4d5

nychiang force-pushed the sparse-ex-raja-dev branch from 00d9967 to 5fdb4d5 Compare March 13, 2023 17:34

clean the code

3a4ee91

cnpetra approved these changes Mar 14, 2023

View reviewed changes

cnpetra merged commit d682ec8 into develop Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse solver on GPU #600

Sparse solver on GPU #600

nychiang commented Mar 2, 2023 •

edited

Loading

nychiang commented Mar 2, 2023

cnpetra commented Mar 2, 2023

nychiang commented Mar 3, 2023

pelesh left a comment

cnpetra commented Mar 6, 2023

cnpetra commented Mar 7, 2023

cnpetra commented Mar 7, 2023

nychiang commented Mar 7, 2023

pelesh Mar 8, 2023

nychiang Mar 8, 2023

pelesh commented Mar 8, 2023

nychiang commented Mar 8, 2023

nychiang commented Mar 11, 2023

pelesh commented Mar 12, 2023

pelesh Mar 12, 2023

pelesh Mar 12, 2023

nychiang Mar 13, 2023

pelesh left a comment

pelesh commented Mar 13, 2023

cnpetra commented Mar 13, 2023

cnpetra commented Mar 13, 2023 •

edited

Loading

Sparse solver on GPU #600

Sparse solver on GPU #600

Conversation

nychiang commented Mar 2, 2023 • edited Loading

nychiang commented Mar 2, 2023

cnpetra commented Mar 2, 2023

nychiang commented Mar 3, 2023

pelesh left a comment

Choose a reason for hiding this comment

cnpetra commented Mar 6, 2023

cnpetra commented Mar 7, 2023

cnpetra commented Mar 7, 2023

nychiang commented Mar 7, 2023

pelesh Mar 8, 2023

Choose a reason for hiding this comment

nychiang Mar 8, 2023

Choose a reason for hiding this comment

pelesh commented Mar 8, 2023

nychiang commented Mar 8, 2023

nychiang commented Mar 11, 2023

pelesh commented Mar 12, 2023

pelesh Mar 12, 2023

Choose a reason for hiding this comment

pelesh Mar 12, 2023

Choose a reason for hiding this comment

nychiang Mar 13, 2023

Choose a reason for hiding this comment

pelesh left a comment

Choose a reason for hiding this comment

pelesh commented Mar 13, 2023

cnpetra commented Mar 13, 2023

cnpetra commented Mar 13, 2023 • edited Loading

nychiang commented Mar 2, 2023 •

edited

Loading

cnpetra commented Mar 13, 2023 •

edited

Loading